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Abstract 

Background: Several methods have been proposed to account for multiple comparisons in genetic association 
studies. However, investigators typically test each of the SNPs using multiple genetic models. Association testing 
using the Cochran-Armitage test for trend assuming an additive, dominant, or recessive genetic model, is commonly 
performed. Thus, each SNP is tested three times. Some investigators report the smallest p-value obtained from the 
three tests corresponding to the three genetic models, but such an approach inherently leads to inflated type 1 errors. 
Because of the small number of tests (three) and high correlation (functional dependence) among these tests, the 
procedures available for accounting for multiple tests are either too conservative or fail to meet the underlying 
assumptions (e.g., asymptotic multivariate normality or independence among the tests). 

Results: We propose a method to calculate the exact p-value for each SNP using different genetic models. We 
performed simulations, which demonstrated the control of type 1 error and power gains using the proposed approach, 
We applied the proposed method to compute p-value for a polymorphism eNOS -786T>C which was shown to be 
associated with breast cancer risk. 

Conclusions: Our findings indicate that the proposed method should be used to maximize power and control 
type 1 errors when analyzing genetic data using additive, dominant, and recessive models. 

Keywords: Genetic association, Multiple testing, Cochran-Armitage trend test, Genetic models, Exact p-value 



Background 

Genome-wide association studies (GWAS) and candi- 
date gene association studies are commonly performed 
to test the association of genetic variants with a particular 
phenotype. Typically, hundreds of thousands of single- 
nucleotide polymorphisms (SNPs) are tested for associ- 
ation in these studies. Associations between the SNPs and 
the phenotypes are determined on the basis of differences 
in allele frequencies between cases and controls [1]. Sev- 
eral statistical methods have been proposed to control the 
family-wise error rate (FWER) for multiple comparison 
testing. 

A simple approximation can be used to obtain a FWER 
of a by utilizing the Bonferroni adjustment [2] of a* = ^ 
and using a* as the threshold for significance for each test. 
Bonferroni adjustment tends to be conservative when the 
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tests are correlated. In genetic association studies, the 
SNPs being tested are typically in linkage disequilibrium 
(LD), which leads to correlation among the tests. An alter- 
native approximation to the Bonferroni adjustment is 

Sidaks correction [3,4], a* = l-(l-a)" which assumes in- 
dependence among tests. Conneely and Boehnke [5] pro- 
posed a correction that does not assume independence 
among tests but assumes joint multivariate normality of 
all test statistics. Other methods to control the FWER in- 
clude using the false discovery rate (FDR) [6,7]. 

In genetic association studies, three genetic models- 
additive, dominant, and recessive-are generally used to 
test each SNP using the Cochran-Armitage (CA) trend 
test [8-12]. In association studies the true underlying 
genetic model is unknown. Some investigators report 
the smallest p-value obtained from the three tests corre- 
sponding to the three genetic models. However, such a 
procedure inherently leads to an inflated type 1 error 
rate. Also, FDR-based methods to control FWER are not 
applicable in this situation because the hypotheses are 
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highly correlated, as the same SNP is tested using differ- 
ent genetic models. 

Thus, there is a need to correct for multiple compari- 
sons corresponding to the three genetic tests performed 
for testing the association of a single SNP. These three 
tests are not only correlated but also functionally dependent. 
The standard methods for correcting for multiple testing 
referred to above are either too conservative or fail to meet 
the assumptions underlying these methods (e.g., asymptotic 
multivariate normality, independence among tests). Several 
approaches have been proposed to account specifically for 
the multiple comparisons of these three genetic models 
[13-15]. However, these approaches assume asymptotic tri- 
variate normality for the additive, dominant and recessive 
test statistics. While this is a reasonable approximation to 
correct for multiple comparisons, preliminary investigations 
regarding the joint distribution of the three test statistics re- 
vealed the following insights: 1) the joint distribution of the 
test statistics is discrete and the grids at which the probabil- 
ity mass function is positive is few and far between; 2) The 
distribution is highly multimodal in most of the situations, 
particularly, when the number of cases and controls are 
different and unimodal only in a handful of situations 
(e.g. when the number of cases and controls are equal). 
Therefore, we propose a method to compute the exact 
joint distribution of the three CA trend tests corre- 
sponding to the additive, dominant, and recessive gen- 
etic models. We used this joint distribution to compute 
the exact p-value for testing each SNP using the different 
genetic models. We performed simulations to demon- 
strate control of type 1 errors and power gains using the 
proposed approach. Finally, we applied the proposed ap- 
proach to assess the significance of the association be- 
tween a promoter polymorphism, eNOS-786T>C and 
breast cancer risk. 

Methods 

Consider a di-allelic SNP locus. The minor (deleterious) 
allele is labeled as a, and the major (normal) allele is la- 
beled as A The deleterious allele a is assumed to affect a 
phenotype Z, which takes the values of 0 or 1: Z = 1 indi- 
cates cases (affected) and Z = 0 indicates controls (un- 
affected). The observed genotype data for the SNP is one 
of three genotypes (A, A), (A, a), or (a, a). Let R x denote 
the number of cases and R Y denote the number of con- 
trols, with R x + R Y = N. Let X l9 X 2 , X 3 and Y v Y 2 , Y 3 be the 
number of individuals with genotypes AA, Aa, and aa in 
cases and controls, respectively. The data can be formu- 
lated in a 2 x 3 contingency table, as shown in Table 1. Let 
Pi>Pi>P3 De the frequencies of genotypes, AA, Aa and aa 
in cases and q lt q 2 , q 3 be the frequencies of these three ge- 
notypes in controls. The values of p b q b i= 1,2,3 can be es- 
timated from the data as p t = Jp- and q l — Jp- . 



Table 1 Genotypic counts, parameterizations, and 
notations for various parameters used in the model 
formulation 



Genotype 





AA 


Aa 


aa 


Sum 


Cases (X) 


x, 


x 2 


x 3 


Rx 


Controls (V) 






Y 3 


Ry 


Sum 


Ci 


c 2 


Q 


N 



There have been many approaches in the literature for 
testing the association between a SNP and disease status. 
The CA test for trend [8] is generally the most popular 
and is available in most genetic analysis software pack- 
ages, such as PLINK [16]. The test statistic for the CA 
test is as follows: 

3 

W = ^Ji(R Y Xi-RxYi), 

i=l 

where the weight, t b is chosen on the basis of the genetic 
model considered: additive, dominant, or recessive. The 
additive model assumes the deleterious effect is linearly 
related to the number of deleterious alleles. The domin- 
ant model assumes the deleterious effect is related to the 
presence of the deleterious allele. And the recessive model 
assumes the deleterious effect is related to the presence of 
both the deleterious alleles. The weights t=[t ll t 2 , t 3 ] cor- 
responding to each of the models are as follows: additive 
model: t = [0, 1, 2], dominant model: t = [0, 1, 1], and reces- 
sive model: t = [0, 0, 1] for genotypes AA, Aa, and aa, re- 
spectively. Let the three test statistics corresponding to 
the additive, dominant, and recessive models be T 1} T 2 , 
and T 3 , respectively. 

The joint distribution 

Each test statistic, T lf T 2 and T 3 , has an asymptotically 
normal univariate distribution. Therefore, the p-values 
for each of these tests can be obtained from their 
asymptotic distributions. However, reporting the smal- 
lest p-value obtained from testing T lf T 2 and T 3> indi- 
vidually leads to an inflated type 1 error rate. If the 
exact joint distribution of the three tests is known, one 
can compute the exact p-value for the SNP that will 
account for the multiple correlated tests. We proceed 
to derive the joint distribution of the three test sta- 
tistics, T x = (R Y X 2 -R X Y 2 ) + 2(R Y X 3 - R X Y 3 ), and T 2 = 
(R Y X 2 -R X Y 2 ) + (R Y X 3 -R X Y 3 ), and T 3 = (R Y X 3 - R X Y 3 ). 
As T 3 = Ti- T 2 , we only need to derive the joint distri- 
bution of Ti and T 2 . It is reasonable to assume that 
the three genotype counts in cases (X l7 X 2 , X 3 ) and the 
three genotype counts in controls (Y 1 ,Y 2) Y 3 ) follow a 
multinomial distribution, with probabilities (pi,p 2 ,p 3 ) 
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and (q lf q 2 , #3) respectively. Let T 



,X 



X 2 

x 3 



and Y 



The test statistics can be written as 



~Ry 


2R Y ' 


and B = 


-Rx 


-2R X ~ 


Ry 


Ry 




_-Rx 


-Rx _ 



T = AX + BY, where A 

Then the joint probability mass function (pmf ) of T lf T 2 is 
given by 

r x R x -x 2 

f T {T u T 2 ) = ^2fx( x 2^s)f Y (h(X 2 ,X 3 ,T u T 2 )) 

x 2 =0X3=0 

where f x ,f y are trinomial probability mass functions and 
h(X ) T)=B~ 1 T-B~ 1 AX. The derivation of the joint pmf 
of Ti, T 2 is detailed in the Appendix. The p-value corre- 
sponding to the test statistic (t v t 2 ) can be computed by 
summing up the probabilities of the test statistics that 
are equally or less probable than the observed test statis- 
tic, which can be written as 

j2Y/at u t 2 ) 

pvalue(t\^ t<i) = T i T 2 

(r 1 ,r 2 :/ r (r 1 ,r 2 )</ r (^,t 2 )> 

The computation of the p-value using the above for- 
mula is nontrivial; however, there are a variety of com- 
putational optimizations and parallels to Fishers exact 
test that can be used to drastically reduce the computa- 
tional complexity (see details in the Appendix). Briefly, 
the CA trend test statistics form a system of constrained 
linear Diophantine equations. The computational opti- 
mizations presented in the Appendix are based on 



exploiting the properties of the linear Diophantine 
equations with trinomial constraints. The solution space 
of these equations corresponds to the discrete space of 
nonzero probabilities for the joint pmf. This discrete space 
has a pattern of overlapping triangles that can be enumer- 
ated based on R x and R Y counts (See Figures 1, 2, 3 and 4). 
To reduce the number of computations in the discrete 
space we first transformed the test statistics to be symmet- 
ric. The pattern of overlapping triangles depends on three 
different scenarios based on the greatest common divisor 
(GCD) of R x and Ry. 1. GCD(R X , R Y ) = 1, 2. GCD(R X , R Y ) = 
R x = R Y and 3. 1 < GCD(R X , R Y ) < mm(R x , Ry)- m scenario 1 
the triangles do not overlap, therefore the p-value can be 
evaluated most efficiently (Figures 1 and 2). In scenario 2 
most of the triangles overlap and the discrete space of non- 
zero probabilities is sparse (Figure 4). In this scenario, we 
proposed an algorithm to exploit this aspect to calculate 
the exact p-value more efficiently. Scenario 3 is the most 
general case which uses the general optimizations of sym- 
metricity and the triangle pattern (Figure 3). The algo- 
rithms to compute the exact p-values for each of the 
scenarios are detailed in the Appendix. 

Simulations 

We performed simulations to evaluate the performance of 
the proposed method and compared our approach with 
standard approaches used in the literature. All the simula- 
tion results were based on 1000 replicate data sets. Each 
replicate dataset comprised 1000 cases and 1000 controls. 
The disease status for each data set was obtained using 
the logistic regression model logit(P(Z = 1)) = fi 0 + /^X, 
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where X is the indicator for genotype, Z is the disease 
status, /?o is the intercept, and /?i is the log odds ratio 
for the SNP. The genotype data for a SNP were simu- 
lated using a minor allele frequency (MAF) of 40% for 
the null hypothesis and two MAFs of 40% and 20% for 
the power comparisons. For the type 1 error compari- 
sons, we simulated 1000 replicate datasets from the 
null hypothesis (i.e., the SNP was not associated with 



disease status), with /? 0 = - 2.5 and pi = log (1). For the 
power comparisons, we simulated 1000 replicate data- 
sets for 40% and 20% MAFs from the alternate hypoth- 
esis (i.e., the SNP was associated with disease status) 
for each of the three scenarios: (1) additive model with 
odds ratio of 1.2, (2) dominant model with odds ratio of 
1.3, and (3) recessive model with odds ratio of 1.3. The 
methods we compared were as follows: performing only 
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additive analyses (additive-only), performing only domin- 
ant analyses (dominant-only), performing only recessive 
analyses (recessive-only), using the p-value based on report- 
ing the smallest p-value of the three genetic models (min- 
p), using the Bonferroni correction approach, and using the 
proposed exact p-value method. 

Results 

The type 1 errors based on 1000 replicates from the null 
hypothesis are shown in Table 2. Analyses based on 
additive-only, dominant-only, and recessive-only models 
gave empirical type 1 errors of 0.044, 0.045, and 0.056, 
respectively, at the 0.05 level of significance. As expected, 
these models provided good control of type 1 errors be- 
cause only one genetic model was tested in these analyses. 
The Bonferroni approach also had a well- controlled, but 

Table 2 Type 1 error comparisons for different 
approaches at the 0.05 level of significance for 1000 
replicates, each replicate representing a data set 
containing 1000 cases and 1000 controls 



Method a =0.05 



Additive Only 


0.044 


Dominant Only 


0.045 


Recessive Only 


0.056 


Min-p 


0.105 


Bonferroni 


0.030 


Exact p-value 


0.047 



Min-p: p-value based on reporting the smallest p-value of the three 
genetic models. 



conservative, type 1 error (0.030 at the 0.05 level of signifi- 
cance). The min-p had a type 1 error of 0.105 at the 0.05 
level of significance, which was very liberal and confirmed 
that the minimum p-value of the three genetic models is 
not a valid test. Finally, our proposed approach provided 
good control of the type 1 error (0.047 at the 0.05 level of 
significance). 

The power comparisons based on 1000 replicates for 
the SNP data simulated using 40% and 20% MAFs for 
the three scenarios when the data were simulated using 
the additive, dominant, and recessive models, respect- 
ively, are shown in Table 3. The top and bottom panels 
of Table 3 depict the results for 40% and 20% MAFs, re- 
spectively. The min-p model was excluded from the com- 
parison because of its inflated type 1 error. When the data 
were simulated using the additive genetic model (column 3, 
Table 3), and were analyzed using only the additive model, 
it had the highest powers (0.816 and 0.656 for 40% and 
20% MAFs, respectively). However, when the data were an- 
alyzed using only the dominant model, the powers were 
0.676 and 0.603 for 40% and 20% MAFs, respectively. Also, 
when the data were analyzed using only the recessive 
model the powers were 0.588 and 0.306 for 40% and 20% 
MAFs, respectively. The powers for the additive only ana- 
lysis were the highest as expected because the true simula- 
tion model in this scenario was additive. However, the true 
model of disease inheritance is generally unknown and one 
performs analyses using all three genetic models. In this 
scenario, the proposed exact p-value method had powers of 
0.743 and 0.584 for 40% and 20% MAFs, respectively, at 
the 0.05 level of significance, which were higher than the 
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Table 3 Power comparisons for different approaches at the 0.05 level of significance for 3 different simulation 
scenarios using genotypes coded as additive, dominant, and recessive, respectively, for 40% and 20% MAFs 

Genotype model 



Additive model Dominant model Recessive model 

MAF Method 

Odds ratio = 1 .2 Odds ratio = 1 .3 Odds ratio = 1 .3 



Additive Only 


0.816 


0.660 


0.410 


Dominant Only 


0.676 


0.803 


0.116 


40% Recessive Only 


0.588 


0.158 


0.589 


Bonferroni 


0.721 


0.671 


0.452 


Exact p-value 


0.743 


0.726 


0.517 


Additive Only 


0.656 


0.774 


0.116 


Dominant Only 


0.603 


0.823 


0.061 


20% Recessive Only 


0.306 


0.102 


0.249 


Bonferroni 


0.556 


0.715 


0.168 


Exact p-value 


0.584 


0.782 


0.197 



The results for each panel are based on 1000 replicates, with each replicate representing a data set containing 1000 cases and 1000 controls. MAF: Minor 
allele frequency. 



Bonferroni method which had powers of 0721 and 0.556 
for 40% and 20% MAFs, respectively. Overall, powers of 
the proposed method were lower than additive model (true 
simulation model) but higher than those of the dominant- 
only, recessive-only, and Bonferroni correction approach. 

When the data were simulated using the dominant 
model (column 4, Table 3), the additive-only, dominant- 
only and recessive-only analyses had powers of 0.660, 
0.803, and 0.158, respectively, for 40% MAF and 0.774, 
0.823, and 0.102, respectively for 20% MAF, at the 0.05 
level of significance. Once again, as expected, the powers 
of the dominant-only analysis were the highest because 
the data were generated using the dominant model. The 
proposed exact p-value method had powers of 0.726 and 
0.782 for the 40% and 20% MAFs, respectively, which 
were higher than the Bonferroni method which had 
powers of 0.671 and 0.715 for the 40% and 20% MAFs, 
respectively. When the data were simulated using the re- 
cessive model (column 5, Table 3), the additive- only, 
dominant-only and recessive-only analyses had powers 
of 0.410, 0.116, and 0.589, respectively, for 40% MAF 
and 0.116, 0.061, and 0.249, respectively, for 20% MAF. 
The proposed exact p-value method had powers of 0.517 
and 0.197 for the 40% and 20% MAFs, respectively, which 
were higher than the Bonferroni method (0.452 and 0.168 
for 40% and 20% MAFs, respectively). 

We applied the proposed approach to assess the sig- 
nificance of the association between the promoter poly- 
morphism eNOS -786T>C and sporadic breast cancer 
risk in non-Hispanic white women younger than 55 years 
from a breast cancer study performed by [17]. The study 
discovered that eNOS -786T>C was statistically significant 
for breast cancer (p=0.017) and included 421 breast can- 
cer cases and 423 cancer free controls. The first panel in 



Table 4 depicts the genotype counts for TT, CT and CC 
genotypes in cases and controls for the eNOS -786T>C. 
The second panel in Table 4 reports the p-values for the 
eNOS -786T>C computed using the 5 different ap- 
proaches: additive-only, dominant-only, recessive-only, 
Bonferroni and the proposed exact p-value method. 
The additive-only, dominant-only and recessive-only 
approaches had p-values of 0.0045, 0.0148 and 0.0313, 
respectively, and the Bonferroni adjusted p-value was 
0.0135. For this SNP, the p-value computed using the 
proposed exact p-value method was 0.0021, which was 
more significant than the smallest of the three p-values 
obtained using the additive-, dominant-, and recessive- 
only analyses (Table 4). 

Discussion 

In this paper, we proposed a method to calculate the 
exact p-value for testing a single SNP using multiple 
genetic models. We recommend using the proposed 
method to maximize power and control type 1 errors 
when analyzing genetic data using additive, dominant, 
and recessive models. The proposed method is robust to 
model misspecifications and different SNP minor allele 
frequencies. Furthermore, similar to the computation of 

Table 4 P-values computed using various approaches for 
association of eNOS -786T> C with breast cancer 



Genotype Data for eNOS -786T> C Method p-value 





Controls 


Cases 


Additive Only 


0.0045 


Total 


423 


421 


Dominant Only 


0.0148 


TT 


203 


167 


Recessive Only 


0.0313 


CT 


185 


200 


Bonferroni 


0.0135 


CC 


35 


54 


Exact p-value 


0.0021 
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Fisher s exact p-value, the proposed approach does not 
depend on asymptotic distributions. 

In our simulation study, where replicate datasets were 
simulated using the null hypothesis, we found that the 
proposed method had well-controlled type 1 error prob- 
abilities. In contrast, the method of reporting the smal- 
lest p-value of the three genetic models tested had the 
highest false-positive rate and was found to be invalid. 
And, as expected, the type 1 error of the Bonferroni cor- 
rection approach was well controlled but conservative, 
which typically led to a loss in power for identifying gen- 
etic variants. 

We also simulated replicate datasets under an alterna- 
tive hypothesis using the different genetic models: addi- 
tive, dominant, and recessive. In these simulations, we 
observed that no single method: additive-only, dominant- 
only, or recessive-only, had higher power in all three sce- 
narios. Each of these methods had higher power only 
when the model used to analyze the data was the same as 
the true model used to generate the data. However, be- 
cause the true mode of disease inheritance is usually 
unknown, analyses using all three genetic models are 
necessary. In general, the Bonferroni correction ap- 
proach led to higher power than using a model that 
did not correspond to the true model. The proposed 
exact p-value method was an improvement over the 
Bonferroni method. The conservativeness of the Bonfer- 
roni method may be due to its inability to account for the 
functional dependence between the three test statistics. In 
contrast, our proposed approach accounts for this func- 
tional dependence by computing p-values from the joint 
probability mass function. Finally, we analyzed breast can- 
cer study data in which the polymorphism eNOS -786T>C, 
was found to be significant [17]. 

The computation time needed to obtain the exact 
p-value is substantial. The problem is very closely related 
to Fisher s exact test, and there are many patterns inher- 
ent in the structure of the problem that could be exploited 
to calculate the p-values more efficiently. In the Appendix, 
we present several novel optimization techniques to effi- 
ciently compute the test statistics in a reasonable time 
(e.g., approximately 15 min for a 1000 cases and 1000 
controls dataset). The software to compute exact p-values is 
available at http://odin.mdacc.tmc.edu/~rtalluri/index.html. 

Conclusions 

In genetic association studies, three genetic models-additive, 
dominant, and recessive-are generally used to test each SNP 
using the Cochran- Armitage trend test. Reporting the mini- 
mum p-value of the three genetic models leads to inflated 
type 1 errors. We proposed an approach to compute the 
exact p-value when genomic data is analyzed using the three 
genetic models. The proposed approach leads to higher 
power while controlling the type 1 error. 



Appendix 

Optimization techniques for computing the exact p-value 

Recall that X lf X 2 , X 3 and Y lt Y 2 , Y 3 are the number of in- 
dividuals with genotypes AA, Aa, and aa in cases and 
controls, respectively, with X x + X 2 + X 3 = R x and Y 1 -\-Y 2 -\- 
Y 3 = R Y . The three genotype counts in cases (X lt X 2 , X 3 ) 
and the three genotype counts in controls (Yi, Y 2 , Y 3 ) follow 
a multinomial distribution with probabilities (p\,p 2 ,p 3 ) 
and (q lf q 2 , q 3 ), respectively. The probability mass function 
(pmf) of {X h X 2 , X 3 ) is f x (X) = x ^l X2 _ Xi) , p R r X2 ~ Xz 
p x Spf and the pmf of (Y lt Y 2 , Y 3 ) \sf Y (Y) = y^^'y^, 

q^ Y ~ Y2 ~ Y3 q Y2 q Y3 . The three test statistics corresponding 
to the additive, dominant, and recessive models are, 
T x = (R Y X 2 - R X Y 2 ) + 2(R Y X 3 - R X Y 3 ) , T 2 = (R Y X 2 - R X Y 2 ) + 
(R Y X 3 -R X Y 3 ) } and T 3 = (R Y X 3 - R X Y 3 ) respectively. As 
T 3 - 7\ - T 2 , we only need to derive the joint distribution 

The test statistics can be written as T = AX + BY, where 



of Ti and T 2 . Let T 



~Ry 


2R y ~ 


and B = 


'-Rx 


-2Rx~ 


R y 


R y 


-R x 


-Rx . 



. We proceed to 



derive the joint probability mass function of T 



Ti 

T 2 



Consider an n-dimensional discrete random vector G 
with pmf/ G (). Suppose we have a transformation from 
G —*H. The pmf f H () of the transformed variables H can 
be expressed as follows: [18] 

IhW =fe{0- 1 (H)) 

This can be extended to the case where the dimen- 
sions of G and H are different, i.e., the transformation 
from (X, Y) — > T is a linear transformation of the form 
T - AX + BY. The pmf of T is given by 

fr(T) = YsfxiXVriHX, T)), h(X, T) = B^T-B^AX 



This can be simplified as: 

/Ti 2T 2 | R Y X 2 \ 

h(X, Y) 



Rx Rx 
T 2 



Rx 



Ti R Y X 3 
\Rx Rx Rx 



Rx Rx-x 2 

f T (T 1: T 2 )=J2Y, fx(X2,X 3 )f Y (h(X 2 ,X 3 , T u T 2 )) 

x 2 =o X 3 =0 

Computing this pmf on all the possible values of T 2 ) 
is prohibitively time consuming. Computational optimiza- 
tions can be used to speed up the computations of the 
probability mass function. We list several optimization 
techniques below. The first optimization is to transform 
the pmf to be symmetric in T 2 ), which reduces the 
computational burden by half. The original test statis- 
tics Ti and T 2 are T x = (R Y X 2 - R X Y 2 ) + 2(R Y X 3 - R X Y 3 ) 
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and T 2 = (R Y X 2 - R X Y 2 ) + (R Y X 3 - R X Y 3 ), respectively. The 
joint pmf of T 2 ) is a one-to-one function of the joint 
distribution of any two orthogonal linear combinations of 
I\ and T 2 . So if we transform the test statistics T x and 
T 2 into 

Z x = (R Y X 3 -R X Y 3 ), 

Z 2 = (R Y X 2 -R X Y 2 ), 

the resulting pmf of (Z lf Z 2 ) is a one-to-one function of 
the pmf of (7x, T 2 ). Hence, the p-value obtained will be 
the same when using (Z 1} Z 2 ) instead of (T lf T 2 ). The 
resulting pmf of (Z^ Z 2 ) can be derived using the same 
method as with (7\, T 2 ). 

The next computational optimization is to identify the 
values that can be taken by (Z 1} Z 2 ). The number of 
values (Zi, Z 2 ) can take are finite and represented by the 
solution space of the equations 

Zi = (R Y X 3 -R X Y 3 ), 

Z 2 = (R Y X 2 -R X Y 2 ), 

which depends on the values of R x and R Y These equa- 
tions are called linear Diophantine equations and have 
an infinite number of solutions [19]. But in our case we 
have multiple constraints on the equations, which re- 
duce the solution space to a finite number of solutions. 
The constraints are 

1. X 3) Y 3) X 2 and Y 2 are integers 

2. X 3 , Y 3i X 2 and Y 2 > 0 

3. X 3 +X 2 <R X 

4. Y 3 +Y 2 <R Y 

On the basis of these four constraints the solution 
space can be calculated. While the exact solution space 
could not be found, it follows a pattern that can be 
enumerated. 

Figure 1 depicts the pmf of the scenario with R x = 19 
and R Y =2 where a pattern of six triangles can be visual- 
ized from the figure. Similarly, Figure 2 depicts the pmf 
of the scenario with R x = 20 and R Y = 3, where a pattern 
of ten triangles can be visualized from the picture. This 
trend can be generalized for all values of R x and R Y 

Generalizing the above scenario, there are [1 + 2 + • • • + 
(Ry + 1) = ( Ry+1 ^ Ry + 2 ^ triangles for the solution space. 

In each triangle, there are [1 + 2+ h (Rx + 1) — 

(r x +i)(Rx+2) ^ e i emen ts that correspond to all possible 
combinations of X 3 + X 2 < R x . In each triangle, the 
values of Y 3 and Y 2 are constant and the (*r+iX*r+2) 
triangles correspond to all possible combinations of 
Y 3 + Y 2 < R Y which make up the whole solution space. 



Another important fact is that these triangles may 
overlap, reducing the solution space, which is depicted 
in Figures 3 and 4. Figure 3 depicts the pmf of the sce- 
nario with R x = 10 and R Y =2 where a pattern of six tri- 
angles can be visualized from the figure. The overlap of 
the triangles can be observed when compared to Figure 1. 
Figure 4 depicts the pmf of the scenario with R x =5 and 
R Y =5 where a pattern of 21 triangles can be visualized 
from the figure, where most of the triangles are overlap- 
ping one another. The additional computational burden is 
to determine where the solution space triangles overlap 
and how many triangles are overlapping at a particular lo- 
cation. This is a function of the greatest common divisor 
(GCD) of R x and R Y . If R x and i? r are co-prime (GCD=1), 
only three triangles overlap at a single point (Z 1 = 0, 
Z 2 = 0) which requires no additional computation. 
When R x and R Y are not co-prime, the triangles over- 
lap at multiples of the GCD of R x and R Y . In this sce- 
nario, multiple values of X 3) Y 3i X 2 , and Y 2 contribute 
to the same (Z ly Z 2 ), 

In an ideal scenario, the total number of computations 

required to compute the pmf of (Z 1} Z 2 ) is (%+1) 2 (i?r+2) 
{R x +i){Rx+2) „ R}&l t w hich can be computed in approxi- 
mately 15 minutes for R x = 1000 and R Y = 1000 using a 
computer with a 3.4-GHz processor and 8 GB of 
RAM. However, the amount of storage required for 
the solution space far exceeds the hardware capabil- 
ities available. In light of this limitation, computa- 
tional optimizations should be employed to avoid 
storing the whole solution space. This limitation leads 
to three possible scenarios: 

1. GCD(R X ,R Y ) = 1 

2. GCD(R X ,R Y )=R X = R Y 

3. GCD(R X , R Y ) < min(R x , R Y ) 

Scenario 1 

When R x and R Y are co-prime, the triangles only overlap 
at a single point (Z\ = 0,Z 2 = 0); therefore, we can inde- 
pendently evaluate each of the possible values of the solu- 
tion space. The p-value is the probability of obtaining a 
test statistic at least as extreme as the one observed, so we 
evaluate the probabilities of each of the possible values of 
the test statistics one at a time. Hence, the p-value is the 
sum of all the probabilities of test statistics that are lower 
than the probability of the observed test statistic. Using 
this procedure there is no need to store any data, which 
leads to faster computation of the p-value from the joint 
distribution. 

Scenario 2 

When R x and R Y are equal, most of the triangles overlap 
with each other. But a pattern has been observed in this 
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scenario, which is shown in Figure 5, where R x = 10 and 
R Y = 10. As seen in Figure 4, the solution space is very 
sparse and only requires computation of the colored cells. 
The possible solution space is spaced R x apart. So if we 
condense the possible solution space, the solution space is 
as shown in Figure 5. Figure 5 shows the number of trian- 
gles overlapping at each point in the solution space. Only 
half of the matrix needs to be computed, as the other half 
is symmetric. The algorithm to compute the p-value is as 
follows. 

Algorithm: 

1. Let R x = Ry=R- The solution space can then be 
constrained to a matrix with 2R + 1 rows and 2R + 1 
columns. Let the center of the matrix correspond to 
the test statistic (Z 1 = 0,Z 2 = 0). 

2. Now, as we can see from Figure 5, we need to 
compute the colored cells in quadrants 3 and 4. In 
quadrant 3, the cells with the same number of 
overlapping triangles are placed diagonally, and in 
quadrant 4, they are placed horizontally and then 
vertically. We exploit the pattern that follows from the 
same number of triangles overlapping at a particular 
cell. 

3. For i=l:R start at (Z Y = -(R- i\ Z 2 = - 1). Find the 
possible combinations of X 3 , Y 3 , X 2 and Y 2 that 
contribute to the cell corresponding to (Z x = -(R-i), 
Z 2 = - 1). Compute the probabilities for the cells 
along the diagonal path in quadrant 3, until Z x = 0. 



Here X 3 and X 2 remain the same; hence, it is trivial to 
compute the probabilities for each cell. 

4. Then in quadrant 4, compute the probabilities for the 
cells along the horizontal path until Z 1 = R-(i- 1); 
here X 3 remains the same and X 2new = X 2 + Z 2 . 

5. Then continue vertically until Z 2 = 0; here X 3 and 
X 2 remain the same. 

This algorithm reduces the computational burden by 
computing the possible combinations of X 3 , Y 3 , X 2 and 
Y 2 that contribute to all the cells only R times, as op- 
posed to computing once for each cell (approximately 
4R 2 times). 

Scenario 3 

This is the general scenario where GCD(R X , Ry) < min 
(R X ,R Y ). Several patterns that can be used to reduce the 
computational burden that could be applied for a par- 
ticular GCD were found, but these could not be general- 
ized to all the possible situations. We instead use a 
straightforward approach to determine the p-value for 
each of the possible solutions for (Z 1} Z 2 ). The algorithm 
is as follows: 

1. For each possible (Z 1} Z 2 ) compute the triangles 
that contribute to this particular point. 

2. Add up the probabilities of each of the elements of 
these triangles to compute the p-value of that par- 
ticular (Z lt Z 2 ). 
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